-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist string statistics data across multiple calls to orc chunked write #10694
Persist string statistics data across multiple calls to orc chunked write #10694
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10694 +/- ##
================================================
+ Coverage 86.40% 86.44% +0.04%
================================================
Files 143 143
Lines 22448 22493 +45
================================================
+ Hits 19396 19445 +49
+ Misses 3052 3048 -4
Continue to review full report at Codecov.
|
…o-chunked-stats-p2
ef13290
to
cf53ed4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very partial review to flush my comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few suggestions for the new benchmark.
Thanks for adding it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! But I think all our new benchmarks are supposed to use nvbench instead of Google benchmark (sorry to be that guy).
Not at all. Switched over to nvbench. |
rerun tests |
1 similar comment
rerun tests |
Well, this is odd.
It doesn't look like |
rapidsai/kvikio#69 should fix it. |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to get to. Thanks for the pointers on writing benchmarks.
(P.S. The approval is on the C++ code. I didn't review the .py
files.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Few very minor suggestions.
@gpucibot merge |
@@ -724,6 +724,105 @@ def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq): | |||
assert stats_num_vals == actual_num_vals | |||
|
|||
|
|||
@pytest.mark.parametrize("stats_freq", ["STRIPE", "ROWGROUP"]) | |||
@pytest.mark.parametrize("nrows", [2, 100, 6000000]) | |||
def test_orc_chunked_write_statistics(tmpdir, datadir, nrows, stats_freq): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into this during a refactor, datadir
and stats_freq
aren't being used in this pytest. Do we want to keep them or remove them @hyperbolic2346 ?
This is the second half of the chunked orc write statistics work. This part enables persisting the string data between write calls, building the file-level statistics from the stripe data, and writing out the statistics in a chunked-write file. Care is made to ensure that everything is persisted by re-using the same variable in the added test for both writes to ensure nothing is missed. This was verified to invalidate the first table before the second call to write.
This will clean up once 10567 goes in as this is branched off that work.depends on #10567
closes #5826